[DataFrame] Implement quantile#1992
Conversation
|
Test PASSed. |
kunalgosar
left a comment
There was a problem hiding this comment.
Looks really good, just a few comments. I wasn't able to reproduce a ValueError from the underlying pandas call. Can you describe when this would occur?
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Use axis = pd.DataFrame()._get_axis_number(axis) instead.
There was a problem hiding this comment.
What would this do here? I'm checking if q is a list or not
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
numeric_only specifies whether or not to filter the non-numeric columns. Does not need error checking.
There was a problem hiding this comment.
If numeric_only is true, then pandas returns value errors for mixed dtype dataframes
There was a problem hiding this comment.
I'm not able to reproduce, seems that pandas drops non-numeric columns when numeric_only is True.
In [3]: df
Out[3]:
col1 col2
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
col1 int64
col2 object
dtype: object
In [5]: df.quantile(numeric_only=True)
Out[5]:
col1 2.0
Name: 0.5, dtype: float64
There was a problem hiding this comment.
This is true also in the axis=1 case.
In [6]: df.quantile(numeric_only=True, axis=1)
Out[6]:
0 1.0
1 2.0
2 3.0
Name: 0.5, dtype: float64
There was a problem hiding this comment.
Whoops, when numeric_only is FALSE, it returns TYPEerrors.
Try on
df = pd.DataFrame({"A": [1, 2,
"B": [2., 3., 4.],
"C": pd.date_range('20130101', periods=3),
"D": ['foo', 'bar', 'baz']})
df.quantile(.5, axis=1, numeric_only=False)
This returns a TypeError
There was a problem hiding this comment.
This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Use the pandas _check_percentile method here.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
When would this ValueError be thrown?
There was a problem hiding this comment.
This exception is thrown when there are only non-numeric columns in this partition
Added comment
There was a problem hiding this comment.
I see, this should probably be handled somewhere to ensure concordance.
There was a problem hiding this comment.
Describe also handles it like this, so if we do something about it we should make sure to modify it there as well
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
_arithmetic_helper should return a Series.
|
Test PASSed. |
|
Test PASSed. |
kunalgosar
left a comment
There was a problem hiding this comment.
Looks much better; should probably look into the right error handling here. Seems like ignoring errors on workers is the wrong solution.
It would be better if the errors were caught in the function logic, before submitting remote tasks. Since this is an issue in describe can you update there too?
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
This is true, although the check here does not handle this case. The error I see is TypeError: Cannot compare type 'Timestamp' with type 'float'.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
As discussed above, TypeErrors are thrown here. Perhaps you can add some logic at the beginning of the function to ensure all the dtypes are comparable.
The issue seems to arise not from non-numeric data, but from non-comparable types.
|
Updated error checking and dtyping |
|
Test PASSed. |
kunalgosar
left a comment
There was a problem hiding this comment.
Few preliminary comments.
python/ray/dataframe/__init__.py
Outdated
There was a problem hiding this comment.
The lines should be consolidated.
There was a problem hiding this comment.
Wanted to keep all the datetime imports on their own line, should I still do this?
python/ray/dataframe/__init__.py
Outdated
There was a problem hiding this comment.
If you do this, you need to remove test_series.
There was a problem hiding this comment.
Should I just move the entire file?
There was a problem hiding this comment.
No, it just needs to be removed from the .travis.yml file
There was a problem hiding this comment.
Don't we want to run those tests? Or no?
There was a problem hiding this comment.
While Series is unimplemented, we are importing it directly from Pandas. Skipping these tests for now as they expect NotImplementedErrors to be thrown.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
I know that this error is concordant with Pandas, but it feels very out of place in this context. Perhaps we should change it to something more descriptive. Honestly, this case seems like a bug with Pandas, it should handle the case where all columns are dropped.
Looping in @devin-petersohn here.
There was a problem hiding this comment.
I agree -- I originally had a more descriptive error but scrapped it in favor of using the pandas one. Something that references mixed dtypes would be better
kunalgosar
left a comment
There was a problem hiding this comment.
A few comments. Looping in @devin-petersohn to discuss the error message shown.
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
As before use axis = pd.DataFrame()._get_axis_number(axis) here
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Seems that np.datetime64 type columns are supported.
In [8]: x
Out[8]:
col1
0 2018-05-04 20:50:29
1 2018-05-04 20:50:29
2 2018-05-04 20:50:29
In [9]: x.dtypes
Out[9]:
col1 datetime64[ns]
dtype: object
In [10]: x.quantile(q=0.5, axis=1, numeric_only=True)
Out[10]:
0 NaN
1 NaN
2 NaN
Name: 0.5, dtype: float64
There was a problem hiding this comment.
Made a fix to handle all datetime objects
python/ray/dataframe/dataframe.py
Outdated
There was a problem hiding this comment.
Why is this function defined twice?
There was a problem hiding this comment.
One returns a Series the other returns a DataFrame
|
Test PASSed. |
be8b6ed to
39e3320
Compare
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Merged, thanks @11rohans! |
* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...
* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...
* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
Updates the
DataFrame.quantilemethod for full use cases.Also updates the init file to account for datetime methods.
Updates equals method so that it can handle one column dataframes correctly